Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing
نویسندگان
چکیده
The success of Google’s Pregel framework in distributed graph processing has inspired a surging interest in developing Pregel-like platforms featuring a user-friendly “think like a vertex” programming model. Existing Pregel-like systems support a fault tolerance mechanism called checkpointing, which periodically saves computation states as checkpoints to HDFS, so that when a failure happens, computation rolls back to the latest checkpoint. However, a checkpoint in existing systems stores a huge amount of data, including vertex states, edges, and messages sent by vertices, which significantly degrades the failure-free performance. Moreover, the high checkpointing cost prevents frequent checkpointing, and thus recovery has to replay all the computations from a state checkpointed some time ago. In this paper, we propose a novel checkpointing approach which only stores vertex states and incremental edge updates to HDFS as a lightweight checkpoint (LWCP), so that writing an LWCP is typically tens of times faster than writing a conventional checkpoint. To recover from the latest LWCP, messages are generated from the vertex states, and graph topology is recovered by replaying incremental edge updates. We show how to realize lightweight checkpointing with minor modifications of the vertex-centric programming interface. We also apply the same idea to a recently-proposed log-based approach for fast recovery, to make it work efficiently in practice by significantly reducing the cost of garbage collection of logs. Extensive experiments on large real graphs verified the effectiveness of LWCP in improving both failure-free performance and the performance of recovery.
منابع مشابه
LOT: A Robust Overlay for Distributed Range Query Processing
Large-scale data-centric services are often handled by clusters of computers that include hundreds of thousands of computing nodes. However, traditional distributed query processing techniques fail to handle the large-scale distribution, peer-to-peer communication and frequent disconnection. In this paper, we introduce LOT, a robust, fault-tolerant and highly distributed overlay network for lar...
متن کاملBPP: Large Graph Storage for Efficient Disk Based Processing
Processing very large graphs like social networks, biological and chemical compounds is a challenging task. Distributed graph processing systems process the billion-scale graphs efficiently but incur overheads of efficient partitioning and distribution of the graph over a cluster of nodes. Distributed processing also requires cluster management and fault tolerance. In order to overcome these pr...
متن کاملLightweight Fault-tolerance for Highly Cooperative Distributed Applications
The recent introduction of high-speed networks, faster processors, and the rapid growth of heterogeneous large-scale distributed systems has enabled the development of distributed applications that move beyond the client-server model to truly harness the computational potential of distributed systems. These new applications will be structured around groups of agents that communicate using messa...
متن کاملDesign of Saturn Architecture Over DHT Data System
A large-scale network is one in which the automated network systems must be able to manage the network as a single entity rather than managing individual connections between networks. A large scale network consists of an overall architecture. An overlay architecture is used to improve internet routing and quality of service guarantees to achieve higher quality streaming media. The existing over...
متن کاملAgent-Based Distributed Parallel Processing
Traditional solutions to large-scale signal processing involve massive supercomputers consisting of multiple processors. Data is processed in a pipelined fashion that can incorporate multiple machines and numerous computing stages. The limitations to this approach include flexibility, scalability, cost and fault tolerance. Our research is focused on a new approach to signal processing that util...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1601.06496 شماره
صفحات -
تاریخ انتشار 2016